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Abstract 

There have been recent surprising reports that whole genes can evolve de novo from noncoding sequences. This would be 
extraordinary if the noncoding sequences were random with respect to amino acid identity. However, if the noncoding 
sequences were previously translated at low rates, with the most strongly deleterious cryptic polypeptides purged by 
selection, then de novo gene origination would be more plausible. Here we analyze Saccharomyces cerevisiae data on 
noncoding transcripts found in association with ribosomes. We find many such transcripts. Although their average ribosomal 
densities are lower than those of protein-coding genes, a significant proportion of noncoding transcripts nevertheless have 
ribosomal densities comparable to those of coding genes. Most show increased ribosomal association in response to 
starvation, as has been previously reported for other noncoding sequences such as untranslated regions and introns. In rich 
media, ribosomal association is correlated with start codons but is not usually consistent and contiguous beyond that, 
suggesting that translation occurs only at low rates. One transcript contains a 28-codon open reading frame, which we 
name RDT1, which shows evidence of translation, and may be a new protein-coding gene that originated de novo from 
noncoding sequence. But the bulk of the ribosomal association cannot be attributed to unannotated protein-coding genes. 
Our primary finding of extensive ribosome association shows that a necessary precondition for selective purging is met, 
making de novo gene evolution more plausible. Our analysis is also proof of principle of the utility of ribosomal profiling data 
for the purpose of gene annotation. 
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Introduction 

Protein-coding sequences found only in a single species, fam- 
ily, or lineage are known as ORFans (Fischer and Eisenberg 
1999). Several mechanisms have been proposed for the or- 
igin of apparent ORFans (Long et al. 2003; Kaessmann et al. 
2009). These include mechanisms by which coding sequen- 
ces give rise to ORFans, for example, through gene duplica- 
tion (including via retrotransposition) followed by rapid 
divergence, through horizontal gene transfer from an unchar- 
acterized source, or through gene fusion/fission. More radi- 
cally, ORFans also arise de novo from noncoding sequences 
(Tautz and Domazet-Loso 201 1). 

BSC4 in Saccharomyces cerevisiae is a remarkable exam- 
ple of a protein-coding gene that evolved de novo via a series 
of point mutations in noncoding sequence (Cai et al. 2008). 
Although at first sight this seems extraordinary, because 



random polypeptides are unlikely to fold stably (Dobson 
1999; Bloom et al. 2007), genome-wide surveys suggest 
that de novo gene birth from noncoding sequences may 
not be so rare (Zhou et al. 2008; Tautz and Domazet-Loso 
201 1). In addition to BSC4, cases have also been proteomi- 
cally confirmed in humans (Knowles and McLysaght 2009; Li, 
Zhang, etal. 2010) and indirectly inferred through fusion con- 
structs for a second open reading frame (ORF) in yeast (Li, 
Dong, et al. 2010). Cases have been inferred via expression 
analyses in Drosophila (Chen et al. 2007), Arabidopsis (Donog- 
hue et al. 2011 ), and rice (Xiao et al. 2009), with protein-coding 
status yet to be determined for these cases. Other cases have 
been inferred bioinformatically in Drosopliila (Levine et al. 
2006; Begun et al. 2007), primates (Tay et al. 2009; Toll-Riera 
et al. 2009), and Plasmodium vivax (Yang and Huang 201 1). 
On the smaller scale of parts of a gene, the conversion of 
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noncoding sequence to coding can also occur through new 
coding exons (Kondrashov and Koonin 2003; Sorek 2007; 
Lin et al. 2009) or incorporation of 3' untranslated regions 
(UTRs) (Giacomelli et al. 2007; Vakhrusheva et al. 201 1) or 
5' UTRs (Wilder et al. 2009) into coding regions. 

Conversion from noncoding to coding seems too unlikely 
an event to happen in a single evolutionary step. The se- 
quence in question must be transcribed, escape degradation 
at the nuclear exosome, associate with ribosomes, be trans- 
lated, and again escape degradation by the proteasome. Fi- 
nally, it must avoid toxic conformations such as amyloid, for 
example, in favor of a stable protein fold. 

At each stage, molecular errors in the present can provide 
a preview of mutations in the future (Whitehead et al. 2008; 
Masel and Trotter 2010; Rajon and Masel 201 1). Selection 
may purge from cryptic sequences those variants whose ex- 
pression is strongly and unconditionally deleterious, even 
when the sequences are expressed only at low levels via mo- 
lecular errors. This purging is predicted to increase evolvabil- 
ity substantially (Masel 2006; Rajon and Masel 2011). At 
first, this result seems surprising because evolution has no 
foresight. But whereas it is impossible to know what will 
be adaptive in the future, it is often possible to rule out what 
will 'not' be adaptive, such as toxic amyloid. The distribution 
of fitness effects of new mutations is strongly bimodal, with 
most mutations either being lethal or having a small effect 
size (Eyre-Walker and Keightley 2007; Fudala and Korona 
2009; Wylie and Shakhnovich 201 1). If the cryptic lethals 
are screened out, then whatever is left, by a process of elim- 
ination, has a greater chance of being adaptive than random 
sequences do. This is the cause of increased evolvability. Be- 
nign cryptic sequences that persist through a selective filter 
against low levels of erroneous expression can provide pre- 
selected raw material to be co-opted for the evolution of 
novelty (Masel 2006; Rajon and Masel 201 1). 

Here we focus on the evolutionary stage just before 
a noncoding sequence is co-opted as a new protein. The 
likely raw material for such co-option consists of transchpts 
of unknown function that escape exonucleolytic degrada- 
tion (stable unannotated transcripts or SUTs; Jacquier 
2009) and associate with ribosomes. The occasional acci- 
dental translation of these transcripts, at low levels, could 
be enough to select against ORFs encoding toxic peptides. 
This preselection would enrich the raw material for those 
peptides most likely to be benign and so increase the likeli- 
hood of de novo gene birth. Because de novo gene birth is 
a real phenomenon in need of explanation, we predict am- 
ple preselected raw material. In other words, we predict that 
there are many noncoding transcripts associated with ribo- 
somes at high enough levels to be consistent with substan- 
tial selection, purging from cryptic sequences those vahants 
whose translation would be strongly deleterious. 

Ingolia et al. (2009) profiled the positions of all com- 
plete ribosomes bound to RNA, providing a snapshot of 



translation. Ingolia et al. (2009) then analyzed patterns 
of ribosomal binding within annotated protein-coding 
transcripts. Here we reanalyze the ribosomal profiling 
data, focusing on ribosomes bound to SUTs. An earlier 
case study looked at three SUTs and found that one 
of them, NMR026W, was associated with ribosomes 
(Thompson and Parker 2007). It was unclear whether this 
SUTwas highly unusual or reasonably typical. Here we ad- 
dress this question on a genome-wide basis and find that 
ribosomal binding to SUTs not only occurs but is also, in 
agreement with our hypothesis, quite common. 

We find that most ribosomal binding of SUTs exhibits 
a strikingly different pattern from binding to coding sequen- 
ces. However, we find one clear exception, demonstrating 
a new example, only 28 amino acids long, where an ORF in 
S. cerevisiae with evidence of translation appears to have 
evolved recently. We call this transcript RDT1 for hbosomally 
detected transcript. 

Materials and Methods 

The set of S. cerevisiae transcripts not containing annotated 
genes was downloaded from http://snyderlab.stanford.edu/ 
Naga2008sup/novel_annotations. track. Transcript informa- 
tion for all annotated ORFs was obtained from table S4 in 
the supporting online material of Nagalakshmi et al. (2008); 
only those transcripts with well-defined UTRs were used in 
our analysis (4,419/6,604). Ribosome footprints and corre- 
sponding transcriptomes described by Ingolia et al. (2009) 
were obtained from the Gene Expression Omnibus (GEO) 
(http://www.ncbi.nlm.nih.gov/geo/) (GEO accession: 
GSE13750). These accessions include mappings of foot- 
prints to the yeast genome available from Saccharomyces 
Genome Database (SGD, http://www.yeastgenome.org/) 
on 22 June 2008. Only footprints that mapped uniquely 
to a single location in the genome without mismatches (a 
little more than 60% of the total) were used in our analysis. 
This yields a false discovery rate of essentially zero (Wang 
et al. 2009). 

Genome sequences for the orthologous intergenic region 
between SPBl and KAR4 orthologs in other fungal species 
were obtained from SGD and aligned using MUSCLE (Edgar 
2004) followed by manual alignment. The SPBl and KAR4 
orthologs were used to anchor the alignment. Subsequent 
alignment was then performed progressively inward until 
converging on the region containing RDT1. Because the or- 
thologous regions in S. I<udriavzevii and S. bayanus could not 
be identified using the alignment, we searched the ortholo- 
gous intergenic sequence for a highly divergent ORF We did 
this using nucleotide position information for the entire or- 
thologous intergenic region between SPBl and KAR4. 

The sehal analysis of gene expression (SAGE) data set was 
obtained from Affymetrix Yeast S98 arrays and provided by 
the lab of Allan Jacobson (He et al. 2003) and the GEO 
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(http://www.ncbi.nlm.nih.gov/geo/) (accession number: 
GSE2579) (Wyers et al. 2005). 

AUGcAi was calculated using the method described in 
Miyasaka (1999). The transfer RNA (tRNA) copy numbers 
for 5. cerevisiae, S. paradoxus, and S. mikatae were ob- 
tained from Scannell et al. (2011) and used to calculate 
tRNA adaptation index (tAI) using the codonR software 
(http://people.cryst.bbk.ac.uk/~fdosr01/tAI/). 

Results 

For each of the two biological replicates of the two exper- 
imental conditions (rich and starved) described by Ingolia 
et al. (2009), we mapped ribosome footprints onto each 
of the 487 novel transcribed regions (SUTs) described by 
Nagalakshmi etal. (2008). Of the 404 SUTs for which Ingolia 
et al. (2009) found evidence of RNA expression, 217 showed 
some ribosomal association (at least one mismatch-free hit 
mapping uniquely to that SUT) in at least one of the repli- 
cates, in comparison to 4,372 of 4,404 expressed, verified 
ORF-containing transcripts. 

Next we quantified the level of ribosomal association to 
produce a histogram of average ribosomal density per ribo- 
somally associated transcript (fig. 1). Ribosome association is 
not uncommon for SUTs and can occur at high frequency 
relative to messenger RNA (mRNA) concentration, especially 
but not exclusively in starved conditions (fig. 1 ). Although SUTs 
have, on average, lower ribosomal densities than protein- 
coding genes do (P < 10"^^ for each of the four replicates, 
Welch two-sample f-test with unequal variance), many indi- 
vidual SUTs have high levels of ribosomal association. 

Next we produced traces of ribosomal association as 
a function of position along each transcript. Each time 
a footprint mapped to a nucleotide position, we incre- 
mented its occupancy by the tag count of the footprint. 
A typical SUT's ribosomal trace shows only a single peak 
(fig. 2A) or several, noncontiguous peaks (fig. 2D; supple- 
mentary fig. SI , Supplementary Material online). Ribosomal 
footprints that map to SUTs are 50% more likely to include 
an AUG triplet than an alternative NUN triplet in rich media 
(P < 10"^; contingency table) but are nonspecific with re- 
spect to triplet identity in starved conditions (P = 0.06). Note 
that it is difficult to know for sure whether ribosomal asso- 
ciation always leads to translation. In this regard, it must be 
noted that translation can occasionally initiate even in the 
absence of an AUG start codon (Ingolia et al. 2009). 

It is possible that some or even many of the SUTs with very 
high levels of ribosomal association are in fact short unan- 
notated protein-coding genes. We therefore looked for 
ORFs within SUTs that might be protein-coding sequences. 
We examined each of the SUT ribosomal association traces 
manually We chose to do this manually because we were 
interested in contiguity in addition to peak occupancy and 
have no validated a priori quantitative metric for contiguity. 
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Fig. 1. — Histogram of ribosomal densities for transcripts anno- 
tated as protein coding and for other ribosomally associated transcripts 
(SUTs). Although ribosomal densities are lower for SUTs, there is 
substantial overlap, with many SUTs associated with ribosomes at levels 
typical for protein-coding transcripts. SUT-ribosome associations in- 
crease under starved conditions, whereas associations with protein- 
coding transcripts do not. Ribosomal density is calculated as the number 
of ribosome footprint mappings for a given transcript normalized by the 
number of mRNA mappings for the same transcript in the same 
experimental replicate. We pooled the two replicates available for each 
of the two (rich vs. starved) conditions. Transcripts showing no 
ribosomal association were excluded: numbers for these are given in 
the text. 

Five transcripts have particularly intriguing ribosomal traces, 
with locations along the transcript having peak occupancy 
of 1 0 or more footprints in at least one of the four replicates. 
For four of these transcripts, ribosomal occupancy did not 
correspond to an ORF and was much higher in starved con- 
ditions (supplementary fig. S2, Supplementary Material on- 
line). Increased association under starved conditions is 
typical for other noncoding sequences such as UTRs and 
introns (Ingolia et al. 2009). 

However, one transcript contained a 28 amino acid ORF 
whose position corresponded to the region of highest ribo- 
some occupancy relative to all other positions on that tran- 
script (fig. 2B and E). The transcript showed higher 
ribosomal association in rich media. This transcript had 
a higher total number of ribosomal hits than any of the 
486 other SUTs in both of the rich condition replicates 
and ranked 1 9th and 8th on this measure in the two starved 
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Fig. 2. — Ribosomal traces for RDT1 , compared with a representative SUT and with a verified sinort protein-coding gene ATP1 9. We see tliat tine 
RDT1 traces show a very similar pattern to the protein-coding traces, whereas most other SUTs have dramatically different traces. Solid and dashed lines 
indicate each of the two replicates for the given condition. Lines are drawn above the positions of ORFs longer than 1 5 codons: the example of a SUT 
shown here contains multiple overlapping ORFs. The raw number of footprints per nucleotide is given in the 5'-3' direction of each transcript. The y 
axis on the right of each figure is normalized for mRNA concentration; they axis on the left is not. The otherwise representative SUTand protein-coding 
gene shown were chosen because of their length similarity to RDT1. 



condition replicates. The start codon context adaptation in- 
dex (AUGcAi) is 0.32, which is well within the range of other 
yeast mRNA (Miyasaka 1999; supplementary fig. S5A, Sup- 
plementary Material online). These observations are all con- 
sistent with translation as a protein-coding gene, rather 
than merely occasional accidental translation. However, it 
should also be noted that the tAI is 0.18, falling only just 
within the range of other yeast mRNA (dos Reis et al. 
2004; supplementary fig. S5B, Supplementary Material on- 
line). We named this transcript RDT1, for ribosomally de- 
tected transcript. RDT1 is located on the Watson strand 
of chromosome III between positions 30768 and 31228. 

We blasted RDT1 using BlastN on the nt/nr nucleotide da- 
tabase, and the only significant hits (e value < 1 0"^), other 
than the same location in S. cerevisiae (i.e., self-hits), were 
found in the syntenic region in 5. paradoxus. We also blasted 
the ORF sequence using the TBIastX algorithm on the nt/nr 
database, in case nucleotide divergence had masked amino 
acid conservation with another species, perhaps one related 
only through horizontal gene transfer. Again, we found only 
self-hits. 

Through the inclusion of adjacent genes, we then forced 
an alignment of known syntenic sequences of other Saccha- 
romyces species (Byrne and Wolfe 2005; see Materials and 
Methods). Although nucleotide sequence identity is low, we 
can confirm sequence homology among S. cerevisiae, S. 
paradoxus, and 5. mikatae (fig. 3); sequences from S. /cu- 



driavzevii and S. bayanus were too divergent from these 
three to be aligned reliably The start codon is present in 
the reference sequence of all three species; however, it is 
followed almost immediately by a stop codon in the S. para- 
doxus reference sequence. Saccliaromyces mil<atae does, 
however, contain a homologous 20 amino acid ORF. We 
looked in the syntenic region of S. kudriavzevii and S. baya- 
nus for any syntenic ORF too divergent to detect homology 
but did not find a match (fig. 4). 

To study polymorphism in RDT1, we downloaded 39 S. 
cerews/ae and 36 S. paradoxus strains sequenced by the Sac- 
charomyces Genome Resequencing Project (http:// 
www.sanger.ac.uk/research/projects/genomeinformatics/ 
sgrp.html, 201 0 Sep). Thirty three S. cerevisiae strains share 
the same ORF allele as the S288C reference strain, and three 
strains (DBVPG6040, UWOPS83 787, and UWOPS87 2421) 
share a second allele of the same ORF with three nucleotide 
substitutions leading to two amino acid differences. The re- 
maining three strains (UWOPS05 217, UWOPS05 227, and 
UWOPS03 461 ; this is the Malaysian cluster identified by Liti 
et al. 2009) share these three nucleotide differences and 
have two more, one of which abolishes the start codon 
and hence the ORF (fig. 3). This shows that translation of 
the RDT1 ORF is not essential in 5. cerevisiae. 

All but one of the S. paradoxus strains clearly lack the 
ORF Twenty five strains have a stop codon in the third codon 
position, whereas 10 strains do not contain a start codon 
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S. mik. ATGGTACAAACAAAAGAATTGCGTCTTT ATGTAAAG CGAAGAGAAAGTGAGTTTTCCC-AATAACCTACGGCAAAGAATACTACAAA 

MVQTKELRL YVK RRESEFS Q* 

S. par. ATGATATGA-CAGAAGATTTTTGTTTTTTTTATATAAAG — GCGAAGAGAGAGTTCCTTCATTC-AGCAATCCGGCGCAAAGAACACTACGGG 

MI* 

S.cer. A ATGATACGA-CAGAAGATTTTTGTTTTT — ATAGTTAAGTCAAGAAGA AATTCTATTTGTCCAGCAATCCGGCGCAAAGAAGACTACTAA 

MIR QKIFVF IVKSRR NSICPAIRRKEDY* 
S.cer. B ATGATACGA-CAGAAGATTTTTGCTTTT — ATAGTTAAGTCAAGAAGA AACTCTATTTGTCCAGCAATCCGGCGCAAAGAAGACCACTAA 

MIR QKIFAF IVKSRR NSICPAIRRKEDH* 
S.cer C ATAATACGA-CAGAAGATTTTTGCTTTT — ATAGTTAAGTCAAGAAGA AACTCTATTTGTCCAGCAATCCGGTGCAAAGAAGACCACTAA 

IIR QKIFAF IVKSRR NSICPAIRCKEDH* 

S. parH ATGATACGA-CAGAAGATTTT-G TTTTTATATAAAA — GCCAAGAGAGAGAGAGTTCATTC-AGTAATCCGGTGCAAAGAACACTAAGAG 

MIR QKIL FLYK SQERESSF SNPVQRTL R... 

Fig. 3. — Alignment of sequences homologous to RDT1 in Saccharomyces mikatae, S. paradoxus, and S. cerevisiae. Amino acids are given 
underneath the center position of each putatively coding codon. Frameshifts mean that not all amino acid positions are homologous. Nucleotides that 
match in any two species are highlighted. Polymorphism among the S. cerevisiae strains is underlined. S. cer. A corresponds to the most common allele, 
also found in the SGD reference sequence, S. cer B is the alternative putatively protein-coding allele (found in DBVPG6040, UWOPS83 787, and 
UWOPS87 2421), and S. cer C(found in the Malaysian strains UWOPS05 217, UWOPS05 227, and UWOPS03 461) does not contain a start codon. S. 
par H refers to UW0PS91 917.1, a Hawaiian strain of 5. paradoxus. 



within the plausible length of a homologous transcript (sup- 
plementary fig. S3, Supplementary Material online). Assum- 
ing that the apparent start codon of the one remaining 
strain, UW0PS91 917.1, is not merely the result of a se- 
quencing error, this strain has a homologous ORF 46 amino 
acids long. This strain is highly divergent from other 5. para- 
doxus isolates and was sampled from a native plant in Ha- 
waii (Liti et al. 2009). 

The start codon context adaptation index (AUGcai) de- 
scribed by Miyasaka (1999) was similar in S. cerevisiae 
RDT1 (0.32), in the homologous ORF in the Hawaiian S. par- 
adoxus strain (0.32), and in the short ORF in S. mil<atae 
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Fig. 4. — Synteny alignment of region including RDT1 . (A) Synteny 
alignment of the Watson strand in the ~2.5-kb intergenic region 
between KAR4 and SPB1 and their orthologs in related species. The 
position of RDT1 in Sacciiaromyces cerevisiae is shown, as is the 
homologous ORF in S. mil<atae. All ORFs >15 codons long, in all three 
reading frames of the Watson strand (including overlapping ORFs), are 
shown for 5. bayanus and S. kudriavzevii. Horizontal position represents 
distance (in nucleotides) from SPB1. (6) Known phylogenetic relation- 
ships of Sacciiaromyces species used (Rokas et al. 2003). 



(0.35). The tAI values for the Hawaiian S. paradoxus homo- 
log and the S. mil<atae homolog are slightly higher at 0.26 
and 0. 1 9, respectively, compared with 0.18and0.17in the 
two S. cerevisiae alleles. 

Protein aggregation was predicted for the ORF using 
TANGO (Fernandez-Escamilla et al. 2004). Surprisingly an 
aggregation-prone hexapeptide is strongly predicted for 
both S. cerevisiae alleles and is weakly predicted in the single 
ORF-containing S. paradoxus strain (fig. 5). However, 
TANGO scores apply only to peptides in isolation and not 
to entire proteins in context, and so this result does not nec- 
essarily imply that RDT1 will aggregate. For example, RDT1 
might form a homo-oligomer or a complex with other 
proteins, in which the aggregation-prone segment is 
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Fig. 5. — Protein aggregation propensity (%) along the length of 
the putative protein, as predicted by TANGO. The y axis indicates the 
estimated likelihood that a peptide would be found in an aggregated 
structure rather than, for example, as an cx-helix or p-sheet (Fernandez- 
Escamilla et al. 2004). These probabilities are based on thermodynamic 
stability and the Boltzmann equation and apply to the peptide in 
isolation from the rest of the protein. Both Sacciiaromyces cerevisiae 
RDT1 alleles have a predicted aggregation-prone sequence from amino 
acids 5-1 1 inclusive. No aggregation is predicted for S. mikatae, whose 
amino acids are not homologous in this region due to a frameshift. The 
single strain of S. paradoxus that contained an ORF shows a weak 
aggregation propensity at a position shifted by one amino acid. 
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sequestered deep within a protein fold. No aggregation pro- 
pensity was detected for the S. mikatae ORF. 

Discussion 

We do not know whether RDT1 codes for a functional pro- 
tein: Its translation could be accidental rather than a product 
of adaptation. It is clearly not essential in S. cerevisiae, as it is 
absent in Malaysian isolates. Nevertheless, its origin would 
still be interesting as a possible intermediate along the 
pathway to de novo gene birth. 

There are two scenarios regarding the evolutionary origin 
of RDT1 as a protein-coding sequence. First, it may have 
evolved de novo on the branch leading to 5. cerevisiae. 

Second, RDT1 might already have been present as a pro- 
tein-coding gene in the common ancestor of S. cerevisiae 
and 5. mil<atae. In this scenario, it was then lost in most 
or all the S. paradoxus lineages and also lost in the Malaysian 
S. cerevisiae lineage. The question is then whether it origi- 
nated de novo after divergence with S. bayanus or whether 
it is evolving so fast that recognizable homology to older lin- 
eages is lost, making it appear to be ORFan. With or without 
recognizable homology in the nucleotide sequence, there is 
no syntenic ORF in 5. bayanus (fig. 4). There is syntenic over- 
lap with a much larger ORF in S. I<udriavzevii but no indication 
whatsoever of homology regardless of how we (manually) 
align the sequences to attempt to force a homologous match. 
For this reason, de novo origination is suggested, but not 
proved, by the homology data. 

The short length of RDT1 is also compatible with, but not 
proof of, its recent de novo origination. Recent de novo orig- 
ination on the S. cerevisiae branch would be further sup- 
ported if the homologous 20-amino acid S. mil<atae ORF 
were found not to be transcribed or if its transcript is 
not ribosomally associated. However, it is difficult to obtain 
conclusive proof of absence of transcription because tran- 
scription may only occur under particular environmental 
conditions that do not match those assayed in the labora- 
tory Our finding of comparable codon adaptation indices in 
S. mikatae is consistent with translation in that species 
but might just as easily be a simple product of chance or 
phylogenetic confounding. 

ORFs appearing by chance in SUTs are likely to be very 
short. Even after they have evolved to become functional 
proteins, they are likely to remain short for substantial pe- 
riods of evolutionary time. Most classical gene annotation 
methods exclude short ORFs (Basrai et al. 1997) because 
they often appear by chance alone and do not code for 
proteins. This means that proteins recently evolved de novo 
will be missed due to their short length. Other gene anno- 
tation methods rely on evolutionary conservation (Cliften 
et al. 2003; Kellis et al. 2003); obviously, these methods 
will also fail to annotate recently evolved de novo pro- 
tein-coding genes. The best methods to date for finding 



short protein-coding genes are proteomic (Kim et al. 
2009). Our approach represents a novel proteomic 
method, strongly suggesting that RDT1 is translated. This 
could be demonstrated more conclusively in the future by 
artificially expressing RDT1 , validating a mass spectrometry 
protocol to detect it in spiked yeast extracts and then as- 
saying native RDT1 peptide levels in yeast. 

We also used our method on an earlier SAGE "noncod- 
ing" data set used by Thompson and Parker (2007) (see Ma- 
terials and Methods for details) and identified multiple 
protein-coding genes not annotated at the time that the 
data set was produced (not shown). All these have since 
been annotated as protein coding. This suggests that ribo- 
somal profiling may be a powerful gene annotation method 
for taxa less well studied than S. cerevisiae. 

Note that although our method can detect shorter pro- 
teins than many other methods, we still have a detection 
threshold of minimum protein length. This is because we 
looked for contiguous ribosomal association, which is more 
striking for longer ORFs. In addition, because our hits do not 
have complete codon specificity, bias caused by overlap 
means that traces have stronger signals in their central re- 
gion and weaker signals at the edges (see supplementary 
fig. S4, Supplementary Material online, for an illustration). 
Very short translated ORFs would have a signal strength cor- 
responding to that found at edges and hence be harder to 
detect. 

We do not yet know whether the peptide encoded by 
RDT1 has been co-opted for a function or whether it is part 
of background evolutionary "noise." But what is really strik- 
ing is our more general finding of widespread ribosomal 
binding to SUTs. A high proportion of the noncoding ge- 
nome is transcribed into SUTs (David et al. 2006). Here 
we have shown that just over half of all SUTs are transported 
to the cytoplasm and bind there to ribosomes, especially at 
AUG codons. 

Although we do not know the extent to which this ribo- 
somal association leads to translation, these SUTs, apart 
from RDT1, do not appear to encode functional protein- 
coding genes. Given the extraordinarily low false discovery 
rate associated with RNA-Seq data (Wang et al. 2009), this 
supports the hypothesis that the high level of ribosome as- 
sociation is due to intrinsically error-prone molecular pro- 
cesses. 

This biological noise may ultimately and fortuitously facil- 
itate de novo gene birth (Rajon and Masel 201 1 ). Short ORFs 
appear frequently by chance and are then likely to be trans- 
lated by accident, at least at low levels. A low level of ex- 
pression is ideal for purging strongly deletehous 
sequences, whereas benign sequences remain effectively 
neutral (Masel 2006; Rajon and Masel 2011). These low 
rates of accidental expression leading to preadaptive purg- 
ing could help provide the raw material for de novo birth of 
protein-coding genes. 
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Supplementary Material 

Supplementary figures S1-S5 are available at Genome Biol- 
ogy and Evolution online (http://www.gbe. oxfordjournals 
.org/). 
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